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There is growing evidence for the involvement of Y-complex nucleoporins (Y-Nups) in cellular processes 
beyond the inner core of nuclear pores of eukaryotes. To comprehensively assess the range of possible 
functions of Y-Nups, we delimit their structural and functional properties by high-specificity sequence 
profiles and tissue-specific expression patterns. Our analysis establishes the presence of Y-Nups across 
eukaryotes with novel composite domain architectures, supporting new moonlighting functions in DNA 
repair, RNA processing, signaling and mitotic control. Y-Nups associated with a select subset of the 
discovered domains are found to be under tight coordinated regulation across diverse human and mouse cell 
types and tissues, strongly implying that they function in conjunction with the nuclear pore. Collectively, 
our results unearth an expanded network of Y-Nup interactions, thus supporting the emerging view of the 
Y-complex as a dynamic protein assembly with diverse functional roles in the cell. 

Coat nucleoporins form the inner core of nuclear pores of eukaryotes, protein supercomplexes responsible 
for the regulated transport of macromolecules between the nucleus and the cytoplasm. The Y-shaped 
Nup84/Nupl07-160 subcomplex (Y-complex) forms the outer ring scaffold, is evolutionarily conserved, 
and is composed of certain key proteins referred to as outer ring coat nucleoporins (Y-Nups - 9 in vertebrates and 
7 in yeast) 1,2 , with common structural features yet elusive sequence similarities 3 ' 4 . While the functional capacities 
of coat nucleoporins are primarily connected with the nuclear pore and, in fact, despite their key role in main- 
taining the integrity of the outer ring, there is growing evidence for their involvement in other processes 5,6 , 
including mitotic spindle assembly 7 and transcription regulation 8 . Few other nucleoporins bind directly to the 
outer rings, rendering detection of their interacting partners experimentally highly challenging 1 . 

In order to identify potential novel functional associations for coat nucleoporins, we thus set out to characterize 
the nine families of Y-Nups across eukaryotes. In particular, we examined in detail their multi-domain architectures 
and their membership in co-expression groups in human and mouse that further support functional interactions. 
By integrating information from these analyses with previous knowledge, we significantiy extend the emerging 
evidence for Y-Nup roles outside the nuclear pore, defined as 'moonlighting' roles in this broader context 9 . 

Results 

To augment the limited set of known protein associations for Y-Nups across eukaryotes 10 , we deploy computa- 
tional and experimental sequence analysis involving extensive sequence comparisons, RNA-seq expression 
profiling across diverse human and mouse tissues, protein domain detection and inference of protein interac- 
tions 11 , using the Drosophila melanogaster Y-Nups as queries. Using established protocols for low- complexity 
masking, sensitive iterative sequence profile searches, consistent labeling and annotation of homologs, automated 
sequence clustering and visualization of sequence similarity, we unambiguously assign the initially detected 
homologies (Supplementary Fig. 1) into nine Y-Nup families (Figure 1). The resulting multiple sequence 
alignments share as low as <10% identity between certain members and their homologs (p<10~ 04 , see 
Methods and Supplementary Fig. 2) 12 . 
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Figure 1 | Automated clustering of validated Y-Nup family relationships. Left: Circos Tableviewer (http://mkweb.bcgsc.ca/tableviewer/) representation 
(see Methods) of the nine automatically generated clusters (C1-C9) and the membership of the detected Y-Nup homologs from iterative sequence profile 
searches (named in alphabetical order, displayed in counterclockwise fashion). Stripes are color-coded according to clusters (for example, CI is shown in 
red, spanning three Y-Nup homologs namely Nup37/SEH1/SEC13); there is a one-to-one correspondence between clusters and homologs except CI 
(above), and Nup75 (C6, C7) and Nupl07 (C3, C9). Outer circle values represent relative contributions, inner circle values correspond to absolute 
numbers. Right: Depiction of sequence alignments derived from profile searches (see Data Supplement DS03). Y-Nup families are assigned to the color of 
their highest-frequency cluster. Full-sized sample alignments are provided for Nupl07 (minimum identity 4%, Supplementary Fig. 3) and Nupl33 
(minimum identity 9%, Supplementary Fig. 4). Note that Nup37 (in CI) and Nup43 (C8) detect fewer homologs due to their restricted phylogenetic 
distribution. 



Clustering of Y-Nups into protein families. We identified —3000 
proteins as Y-Nups (Supplementary Table 1), many of which are 
reported here for the first time, especially for lower eukaryotes - 
including the previously undetected presence of Nup43 in fungal 
species (see Supplementary Text). These results confirm the 
universal distribution of the Y-Nup nuclear pore subcomplex in 
eukaryotes 13 . In particular, it is noteworthy that many of the pro- 
tein sequences we detect here have not been reported previously in 
annotation efforts, due to the presence of subtle sequence similarities 
that are confounded by extensive low- complexity regions or repeats 
(e.g. WD40): of the 2962 entries in the resulting Y-Nup compen- 
dium, there are 1813 characterized and 1149 newly discovered Y- 
Nups, thus increasing the level of characterization by more than 63%. 
It should be pointed out that without low-complexity, composi- 
tionally biased region detection, the majority of these similarities 
are lost, mostly due to the presence of WD40 repeats, particularly 
for shorter query protein sequences. Based on our workflows (see 
Methods and Data Supplements DS01-09), we assigned detected 
homologs automatically into similarity clusters 12 following detailed 
validation, essentially replicating our meticulous manual characteri- 
zation in a highly consistent, reproducible manner. Of the 22,033 off- 
diagonal hits (i.e. excluding self-hits) in an all- against- all sequence 
comparison, 5,403 (24.5%) and 557 (2.5%) Y-Nups exhibit pairwise 
sequence identity <30% and s80% respectively (Supplementary 
Fig. 2). The nine independently derived clusters detected by the 
automated procedure correspond to all known classes of Y-Nups 
with the Nup37/SEH1/SEC13 families merging into the largest 
group (1077 members, minimum identity 7%), while the two 
smallest clusters represent distant sub-families of Nup75 (70 



members from Ascomycota, minimum identity 8%) and Nupl07 
(10 members from Trypanosomatidae) (Figure 1). 

Coordinated tissue- specific gene expression. The rigorous 
delineation of Y-Nup structural features drawn from evolutionary 
relationships and multi-domain architecture is a prerequisite for the 
inference of genome-wide functional relationships both at the gene 
expression and protein interaction levels 14 . We thus examine Y-Nup 
gene expression tissue specificity patterns 15 , via RNA-seq data across 
a wide range of tissues and cell lines in human and mouse 16 
(Figure 2). There is a remarkable consistency of Y-Nup expression 
patterns across the two species 17 (see Methods), with the most 
prominent feature a detected over-expression of Nup98 (Nup98- 
96) and SEC 13 in testis. Also, SEC 13 is more highly expressed in 
muscle, liver, kidney, heart and neural tissue than SEH1, Nup43 and 
Nup37 are significantly more expressed in mouse than in human 
testis, and mouse SEC13 has a higher expression in heart tissue 
compared to human (Figure 2). Exon skipping is found to be 
limited, with subtle tissue specificity patterns and minor alternative 
exon splicing events for Nup98 (Nup98-96) observed in both species 
(not shown), indicating a tight, evolutionarily conserved regulation 
at the transcriptional level (Figure 2). 

Having established precise protein family relationships across Y- 
Nups and their coordinated gene expression patterns in two mam- 
malian species, we then proceeded to the identification of domain 
associations and the extraction of their corresponding expression 
profiles. Domain associations can be used to infer the range of cel- 
lular functions that certain Y-Nup subunits might be performing, 
previously undetected by more traditional approaches 18 . These 



SCIENTIFIC REPORTS | 4:4655 | DOI: 1 0. 1 038/srep04655 



2 





Figure 2 | Gene expression patterns of Y-Nups in human (left) and mouse (right) across seven representative tissues based on RNA-seq profiles. Circos 
Tableviewer (http://mkweb.bcgsc.ca/tableviewer/) representation as in Figure 1. Y-Nups are labeled by their gene names in the corresponding species (as 
in Figure 1, except NUP85/Nup85 equivalent to Nup75). Tissue labels are self-explanatory (WT signifying wild-type for mouse, -P signifying a single 
tissue sample). Note that SEC 13 exhibits a much higher expression level than SEH1 possibly due to its participation in other macromolecular complexes. 
The full RNA-seq patterns across a wide range of tissues and cell lines (see Methods) are provided (Data Supplement DS10). 



implied moonlighting functions 19 for the homologous single-domain 
counterparts strongly point to the association of the Y-complex with 
other fundamental, yet transient processes at a given timepoint dur- 
ing the cell cycle 20 and nuclear pore reorganization 21,22 . In fact, as 
mentioned above, the presence of common-repeat patterns in Y- 
Nups have occasionally confounded their detailed structural and 
functional characterization 23 , delineated with greater accuracy in this 
study. 

Multi-domain architectures of Y-Nups. Following the above rea- 
soning, we are thus able to detect 27 novel multi-domain architec- 
tures for Y-Nups (Supplementary Table 2), using an adaptive length 
threshold for the manual inspection of thousands of sequence 
alignments (see Methods), which in principle might involve genu- 
ine domain associations for Y-Nups 11,18 . These domains correspond 
to a wide range of functional categories, not directly related to nu- 
clear pore formation, and thus warrant further investigation, using 
criteria for genome structure, gene expression and phylogenetic 
distribution. To validate the detected associations, we have first 
performed genomic sequence comparisons, using linker sequences 
of the corresponding multi-domain molecules as queries for genome 
and expression nucleotide sequence databases (see Methods and 
Data Supplements DS05-06): eight cases are supported by these ex- 
haustive genomic searches (Supplementary Table 2, ' by Genome '). 
Despite the fact that all homologs derive from complete genome 
sequences or assemblies (not shown) - represented by over 
300,000 genes, there are quality issues that require independent 
experimental confirmation. We subsequently validate these archi- 
tectures using the homology-based RNA-seq expression data from 
human and mouse (Supplementary Table 3): six cases are supported 
by this extensive genome-wide coverage (over 4 billion reads per 
species, Supplementary Table 4), across tissues and cell lines 
(Supplementary Table 2, ' by Expression '). Genes that display 
coordinated expression across diverse cell and tissue types tend to 
share common functions, and the property of co-regulation has been 
used to predict gene function: herein, we use coordinated gene 



expression patterns as an additional level of validation for domain 
discovery associations. Remarkably, while there are three cases 
supported by both genomic and expression evidence, there are 
another three cases supported by either of the above, as well as 
presence in multiple species (' by Frequency ') (Supplementary 
Table 2). While cases with variable support will require further 
experimental probing, six strongly supported cases (Table 1) can 
be unambiguously connected with coat nucleoporin function 
(Figure 3): five of these are found in more than one species. 

Given the scarcity of known functional relationships for Y-Nups - 
partly due to technical limitations, the detection of novel genome- 
wide associations can expand their possible roles beyond the nuclear 
pore 6 , to include transient processes rarely detectable by targeted 
experiments. Thus, when validated by exhaustive functional geno- 
mics evidence, the inferred associations pointing to moonlighting 
roles of Y-Nups are highly consistent with the limited experimental 
evidence available both for gene expression (Figure 4) and protein 
interactions (Figure 5), in the broader context of biological processes 
as indicated by the associated domains (see also Data Supplement 
DS11). Beyond nuclear pore formation and maintenance 5 7 , the Y- 
Nups found associated with the strongly supported architectures 
(Figure 3, Table 1) can be linked to cellular processes - also prev- 
iously reported, viz. cf. - involved in RNA processing and transport 
(cf. Rael 24 ), DNA repair 25 (cf. RAD52 26 ), chromosome maintenance 
(cf. Sir4p 27 ) and centrosome control 28 (cf. Cenp-F 29 ). 

Certain domain configurations with limited support might be due 
to sequencing artifacts, gene prediction or short-read assembly 
errors. Of those, four cases deserve further discussion although they 
are not admitted in our final list. The association of Nup75 (Nupl53) 
from Naegleria gruberi (GL290983204) with FG-repeats 3u might rep- 
resent a genuine case (see Supplementary Text). Another intriguing, 
low-support architecture is an association of SMC domains 6 (con- 
densins) with Nup75 of Chlorella variabilis (GF307108886): both 
pairwise correlations (Figure 3B) and rank correlation clustering 
(Figure 4) indicate a co-expression of human paralogs with Nup75, 
SMC1A being the most Nup75-coordinated paralog across human 
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Table 1 Moonlighting functions for Y-Nups indicated by functionally diverse domains, (column labels: Nucleoporin - name of Y-Nup; Y- 
Nup Gl - representative composite protein identifier; Interaction partner - name of protein domain; Domain Gl - representative single- 
domain protein identifier; Function - general function of associated domain; Biological process - Gene Ontology category identifier, 
representing functional association of corresponding domain). 


Nucleoporin 


Y-Nup Gl 


Interaction partner 


Domain Gl 


Function 


Biological process 


NUP160 

NUP98 
NUP43 
SEHl 


322708659 
166240053 
342873147 
345482402 
320581285 
33086682 


RAD51 

CcmE/CycJ domain' 
SET domain 
DHX15 helicase 1 
TAF9/Chs5p-Arf 1 p 1 
Centrosomal protein 192 1§ 


322698012 DNA repair 
1 1 8587747 cytochrome C biogenesis 
302917798 chromatin regulation 
332019512 mRNA processing 
EFW95506.1 transcription/protein export 
351712025 mitotic control 


0006281 
0017004 
0016570 
0006397 
0006352/0015031 
000705 1 /0007098 


Superscript symbols signify database records wi 
correctly annotated database entries. 


h a missing Y-Nup annotation 1 or the detection 


ofthe corresponding c 


omain § inthe Y-Nup composite protein (Y-NupGI)-c 


II other cases can be regarded as 



tissues (Supplementary Table 3); independent observations from 
stem cell Oct4 interactions provide additional evidence 31 , although 
the particular Chlorella instance will need to be further validated. The 
third case with no counterpart elsewhere is the co-occurence of 
Nupl07 with DUF1767 (domain of unknown function) in the flat- 
worm Clonorchis sinensis (GI:358337287); moreover, DUF1767 is 
found in Rmil, a protein controlling genome stability in yeast 32 
and exhibits the highest pairwise correlation of coordinated gene 
expression with Nupl07 (Figure 3B). Finally, while not adequately 
supported, the co-occurrence of acetyl-CoA carboxylase with Nup75 
in Rhodotorula glutinis (GL342319109) provides clues for a sus- 
pected role of lipids in nuclear pore formation 33,34 . While all other 
cases are indeed tantalizing (including, e.g. aminopeptidase 35 ), we 
conclude that more experimental and phylogenetic evidence is 
required and thus might not deem them as strong candidates for 
functional association with Y-Nups. 

Validation and discovery of Y-Nup moonlighting functions. 

Strong functional genomics evidence for association with Y-Nups 
is detected for six domains (Table 1). Using the enriched Y-Nup 
group discovered by tissue-specific expression (middle block in 
light green, Figure 4), further substantial support for the six novel 
discoveries is obtained from high-throughput experiments (Figure 5), 



via a composite query to GeneMANIA 36 (see Methods). By parti- 
tioning this network into two sub-networks, with the known cases 
and discovered multi-domain architectures deemed as positives (25 
in number, average network connectivity 18) and all other nodes as 
negatives (depicted in light blue and grey, 52 in number, average 
network connectivity 12), the inferred nucleoporin-induced network 
exhibits a striking difference in topological complexity, thus placing 
the newly discovered multi-domain architectures pointing to 
moonlighting roles into a functionally coherent context. 

The RAD51-Nupl60 composite protein found in two fungal spe- 
cies, Metarhizium anisopliae ARSEF 23 (GL322708659) and 
Phaeosphaeria nodorum SN15 (GL169623440), annotated automat- 
ically in the corresponding sequence records, is strongly supported 
by gene expression data for human tissues, tightly co-ordinated not 
only with Nupl60 but also Nupl07, Nupl33 and Nup43 (Figure 4). 
Interestingly, this association is also observed as a tandem gene clus- 
ter in Fusarium oxysporum lycopersici supercontig 2.1 (genes Foxg 
00234/5: https://img.jgi.doe.gov/cgi-bin/imgm_hmp/main.cgi?section 
= ScaffoldGraph&page = alignment&scaffold_id = 250752503 1 , 
supercontig_2.1&coordl = 779427&coord2 = 779510), further de- 
tected as a conserved pattern in multiple species where Nupl60 re- 
mains unidentified (see https://img.jgi.doe.gov/cgi-bin/imgm_hmp/main. 
cgi?section=GeneNeighborhoodckpage=geneOrthologNeighborhood 
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Figure 3 | A) Multi-domain architecture of representative Y-Nup homologs. Y-Nup domains identified by sequence searches (shown in light blue) 
found to be associated with protein domains with functions indicating moonlighting roles (shown in green). Grey boxes signify other protein regions. 
Scale is provided above and below the multi-domain diagram representations. Only the six cases with highest support (Supplementary Table 2) are shown 
(see also Table 1 ) . A full characterization of the 27 validated domain associations is also provided (Supplementary Fig. 5). The phylogenetic distribution of 
all 27 reported domains is available in Data Supplement DS08. B) Coordinated RNA-seq expression patterns across twenty representative human tissues 
between Y-Nups and other domains supported by expression. Pearson correlation coefficient values across the twenty tissues are shown on the 
x-axis (see Methods). Gene names for human (left) and the encoded labels in this study (right) are shown on the y-axis (Supplementary Table 3), 
separated by a vertical bar. Full clustering analysis with Spearman rank correlation coefficients supports nine of the eleven cases (green bars) shown here, 
i.e. all listed genes except TDRD3 and TAF9B (middle block, in Figure 4). 
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Figure 4 | Clustering of human gene expression patterns across multiple tissues and cell lines. Gene expression affinities are represented in a heat map by 
Spearman rank correlation coefficients as similarity measure across tissues and cell lines (see Methods). The narrow strip on both sides of the map 
indicates the entries corresponding to Y-Nups (red), homologs of associated domains (blue) - including human paralogs (Supplementary Table 3) found 
in other species (Supplementary Table 2), and a randomly selected set of 300 genes (grey); see also Methods. By splitting the clustering dendrogram at the 
second bifurcation, three clusters emerge depicted by three major blocks (color-coded, left side). The middle block (light green) is enriched in Y-Nups 
except SEC13, the upper cluster (light blue) contains SEC13, while the lower cluster (light red) does not contain any known Y-Nups. The inset (right side) 
magnifies entries in the middle block, where Y-Nups are included: gene names are shown, color scheme as for entire strip; green labels signify genes 
encoding for Y-Nup associated domains, supported by the bootstrap analysis (Supplementary Table 3). This figure is available at higher resolution as 
Supplementary Figure 6. 



&gene_oid= 25083461 14&show_checkbox= l&cog_color=yes&use_ 
bbh_lite=l). Additional experimental evidence is provided by Oct4 
interactions (e.g. RAD50) 31 as well as DNA damage response (DDR) 
studies, e.g. a si-RNA-based microscopy screening of ionizing radi- 
ation responses, pointing out the critical role that nucleoporins 
might play in genome maintenance 37 . The adjacent configuration 
of Nupl60-CcmE in four uncharacterized proteins of Amoebozoa 
(Figure 3A) Dictyostelium discoideum AX4 (GI:166240053), D. pur- 
pureum (GL330796511), D. fasciculatum (GL328873820) and 
Polysphondylium pallidum PN500 (GL281210825) provides strong 
comparative genomics evidence, in the absence of solid expression 
data: the role of CcmE in this context remains unknown at present 38 . 
NUP98 (Nup98-96) is found in association with a SET domain in 
three fungal species, namely in annotated Pusarium oxysporum 



Fo5176 (GI:342873147), and two uncharacterized proteins in 
Metarhizium acridum CQMa 102 (GI:322698664) andM. anisopliae 
ARSEF 23 (GI:322711125). Curiously, a similar configuration is 
found in patients with acute myeloblastic leukemia, with the fusion 
protein N-terminal NUP98-MLL acquiring a H3K4 mefhyltransfer- 
ase ability through the SET domain present in MLL 39 . Similar obser- 
vations support the association of SET with Nup98 40 , for instance the 
fusion of Nup98 to NSD1 (another SET-containing histone methyl- 
transferase) 41 . The Nup43-DHX15 helicase association found in 
uncharacterized proteins of multiple insect species, for instance 
Nasonia vitripennis (GE345482402), is consistent with the presence 
of a Werner helicase interacting protein in the Y-complex 42 and 
DDX10 in leukemia 43 , while it is also detected in Oct4 interactions 
along with Nup43 31 and very strong correlations with multiple Y- 
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Figure 5 | An expanded network of Y-Nup interactions from high-throughput experiments and the discovered multi-domain architectures. Using as 
queries the human gene symbols of the molecules with coordinated gene expression patterns from the Y-Nup enriched middle block (see Figure 4), the 
resulting network is extracted by GeneMANIA with default parameters (only co-expression, physical and genetic interaction networks are retained). 
Queries were 60 (of which 3 are not found as interacting), depicted as diamonds (57 in total). An additional 20 genes are discovered by GeneMANIA, 
depicted as circles. The total genes in this network amount to 77, encompassing a number of known associations of Y-Nups (shown in purple, 8 in 
number) with other nucleoporins (shown in pink, 4 in number - including Seel 3, not in query), e.g. TPR 5S and related molecules (shown in light blue, 35 
in number), e.g. EXOl 59 . The reported molecules by GeneMANIA include interactions from large-scale experiments not further discussed (shown in grey, 
17 in number). The three coordinated gene expression instances regarded as negatives in this study (shown in cyan, 3 in number) are ARG1 (curiously 
reported by GeneMANIA, thus shown as circle), ACACA and PGM2. The query molecule C2orf34 (synonym: CAMKMT, thus shown as diamond) is also 
reported by GeneMANIA. The six discovered novel domain associations (shown in purple, 10 in number), include the five of the six molecules with 
highest support (except CcmE/CycJ, Table 1) and five SMC paralogs (SMC1A, SMC2-4, SMC6), not previously found in association with Y-Nups. 
GeneMANIA reports no evidence for the association of NUP160-RAD51, NUP98-SET, NUP43-DHX15 and SEH1-TAF9, while providing strong 
evidence for SEH1-CEP192 6<M2 . Common genes between PINA & GeneMANIA include other nucleoporins (e.g. NUP93) or others (e.g. EEF1G, 
Elongation factor 1-y) (see Methods). The annotated layout and GeneMANIA results with supporting literature are available in Data 
Supplement DS1 1. 



Nups (Supplementary Table 3, Figure 4). Most importantly, the inter- 
action of DEAD-box helicases with other nucleoporins, for instance 
Ddxl9 with Nupl59, has been reported at the molecular level 44 . The 
association of TAF9 domain linked to Chs5p-Arflp-binding domain 
and SEH1 in Ogataea parapolymorpha (GI:320581285) is supported 
by coordinated expression in human (TAF9, Figure 4) and the known 
involvement of TAF9 in the SAGA complex for chromatin remod- 
elling 6 . Finally, centrosomal protein 192 (CEP 192 in human) with a 
role in both centrosome maturation and spindle assembly 45 is 
detected at the C-terminus of SEH1 -specific WD40 repeats in mul- 
tiple vertebrate species including rodents and marsupials, with high 
support by correlation clustering (Figure 4) and the presence of 
CEP 192 with other centrosomal proteins and - curiously - Nupl60 
(figure 2 of cited work) 28 . Remarkably, this gene pair is also conserved 
in tandem organization across several vertebrate genomes (not 
shown). A set of complex patterns of variable functions is thus sug- 
gested by domain association analysis and validation (Table 1). 



Discussion 

We have demonstrated the presence of particular domains with a 
wide range of functional roles in four Y-Nup instances (Table 1), 
indicating the association of those domains with the nuclear pore as 
unraveled by functional genomics evidence and evolutionary con- 
servation. While issues of sequencing or assembly artifacts remain a 
possibility and will pose a continuing challenge for whole-genome 
analysis of this kind, there is strong evidence supporting our findings 
in recent experimental studies 5,27 . In this work, we encountered those 
issues arising from short-read genome assemblies which required the 
use of independently derived information to support domain asso- 
ciation analysis: our approach can thus be regarded as a proposed 
framework for function prediction, which could be further auto- 
mated and made available for the wider community. In particular, 
comparative genomics reveals the extent to which the discovered 
domain relationships are conserved and can pinpoint towards spe- 
cies-specific adaptations rather than artifacts. These instances can be 
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assessed experimentally in situ, with advances in novel imaging and 
molecular technologies 46 : indeed, further experimental analysis will 
shed light into these multi-domain associations. 

Our analysis exemplifies how genome sequence and functional 
genomics data can be coupled to unravel intricate associations of 
key supramolecular complexes known to defy biochemical char- 
acterization at present. Although our results cannot prove the dis- 
covered associations definitively, they direct future experimental 
efforts. As recently articulated, domain association inference (if 
properly executed) can yield low-coverage yet high-precision func- 
tional relationships and might supplement interaction proteomics 18 . 
Herein, we augment substantially the set of known interactions for Y- 
Nups, contributing evidence for new instances of functionally diverse 
molecules that are omnipresent in different taxonomic categories. 
Our results indicate that the structural and functional characteriza- 
tion of Y-Nups thus obtained represents a step towards a better 
understanding of the functional versatility of this key nuclear pore 
subcomplex. In summary, our results are consistent with the emer- 
ging view that Y-Nups, rather than serving as inert components of 
the nuclear pore, are actually functionally diverse and possess unex- 
pected moonlighting functions 5 ' 46 ' 47 . 

Methods 

Data collection. Proteins from the D. melanogaster nuclear pore complex considered 
as stoichiometrically assembled Y-Nups (i.e. explicitly excluding ELYS) were 
collected and tabulated (Supplementary Table 1 ). We maintain the order according to 
previous reports 10 . 

Sequence filtering & searching. All sequences were masked using CAST 48 with score 
>15 and otherwise default parameters, to exclude subtle compositional bias, 
including well-known repeats found in these proteins {Data Supplement DS01). In 
total, 160 regions were filtered out for such elements. These low- complexity, 
compositionally biased regions are provided separately, for further study (Data 
Supplement DS02). 

The masked sequences were used as queries against the non-redundant protein 
sequence database (NRDB) at NCBI (15,052,178 entries) 49 with BLAST (e-value cut- 
off threshold 10~ 06 ) 50 . Furthermore, these searches were manually executed with PSI- 
BLAST with a variable number of iterations until convergence (PSI-BLAST para- 
meters: e-value cut-off threshold 10~ 04 , 500 alignments, CAST score >15) 
(Supplementary Table 1), in particular to delineate possible anomalies such as multi- 
domain structure (Data Supplement DS03). Results from the above searches were 
evaluated (and confirmed with reverse sequence searches, not shown) and multi- 
domain similarities were extracted for subsequent analysis (for similarity distribu- 
tions see Data Supplement DS04). Validity of domain associations was assessed by 
searching with linker sequences (Data Supplement DS05) against nucleotide data- 
bases - as a proxy for visual inspection of genome browser tracks; linkers were 
extracted and searched against these data collections within boundaries of ± 20 amino 
acid residues where possible (Data Supplement DS06) and associated domains were 
separately extracted (Data Supplement DS07) and examined for taxonomic distri- 
bution (Data Supplement DS08). Multiple alignments were extracted and visualized 
by JalView 51 - using redundancy elimination interactively until the production of 
visually appealing multiple alignments (Data Supplement DS03). 

Clustering & annotation. All detected homologs labeled accordingly were compared 
using BLAST in an all- against- all mode (e-value cut-off threshold 0.01), following 
CAST masking as above. The similarity pairwise list was submitted to MCL sequence 
clustering using an inflation value of 1.2; clusters were incrementally assigned to an 
integer identifier 12 . Clusters are sorted by their size (number of members in a cluster, 
Data Supplement DS09); thus, the largest clusters have smallest integer identifiers 
(groups with 2 or less members are omitted, namely 12 instances). These cases (12/ 
2962 or 0.4%) yield a sensitivity level of 99.6%. Conversely, two 'false' positives in 
clusters CI (Nup98,GI:262118708) and C2 (Nup75, GL307191801) yield a specificity 
level of 99.9% (under further investigation - Promponas et al, in preparation). 

Expression profiles and protein interactions. Next- generation sequencing (NGS) 
data for a wide range of human and mouse tissues and cell lines were extracted from 
multiple available sources (Supplementary Table 4 - for other species, data are not as 
rich). Expression data for each instance were measured using cRPKM units [corrected 
(form mappability) Reads Per Kilobase per Million mapped reads], calculated as 
previously described 16 ' 52 . The orthologs from human and mouse were analyzed for 
tissue -specific gene expression across all samples 1617 (Data Supplement DS10). 
Identification of cassette alternative exons and quantification of their transcript 
inclusion levels across samples was performed as previously described 53 (see also: Hon 
et al, submitted). Both sequence clusters and gene expression profiles (Figures 1 and 
2) were visualized with Circos 54 . 



Gene expression data for Y-Nups and associated domain homologs in human 
(Supplementary Table 4) were subject to bootstrap rank correlation statistics 
(Supplementary Table 3). Expression patterns from 300 randomly selected human 
genes were systematically sampled 500 times with replacement for bootstrapping, in 
subsets of 100 expression patterns. Each subset was merged with Y-Nup and assoc- 
iated domain homolog expression profiles for Spearman rank correlation analysis, 
and average ranks were recorded (Supplementary Table 3). The complete gene 
expression dataset (human Y-Nups, human homologs of the 27 associated domains, 
random sample of 300 human genes) was clustered based on Spearman rank cor- 
relation coefficients (Figure 4). 

Known protein interactions were extracted from the PINA database 55,56 and 
annotated appropriately; these data were augmented by the discovered domain 
associations (Data Supplement DS1 la), and are made available in BioLayout 57 format 
for visual exploration. Coordinated tissue-specific gene expression data enriched in 
Y-Nups were used as a composite query to GeneMANIA 36 resulting in supporting 
evidence from high-throughput experiments (Data Supplement DSllb). Note that 
the PINA results are used only to reflect the current status of knowledge for Y-Nup 
interactions while the GeneMANIA results are used to discover and provide support 
for the novel findings reported here. 

Entire Y-Nup sequence compendium. All 2962 Y-Nups +27 external domain — 
2989 sequences detected by the above analysis are labeled by property and domain 
association and provided in FASTA format for further study and a possible basis for a 
more consistent nomenclature (Data Supplement DS12). 

Data availability. All results (in 12 Data Supplements) are available as a ZIP archive 
(58.3 MBytes) on http://dx.doi.org/10.6084/m9.figshare.840452 
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